Welcome to MUSA 620:
Geospatial Data Science in Python

My day job

  • My name is Nick Hand
  • I am the Director of Finance, Data, and Policy for City Controller Rebecca Rhynhart
  • Goals of our team:
    • objective, data-driven analysis of financial policies impacting Philadelphia
    • increasing transparency through data releases and interactive reports

Previously:
Astrophysics Ph.D. at Berkeley


How did I get here?

  • Astrophysics to data science is becoming increasingly common
  • Landing a job through Twitter: https://www.parkingjawn.com
    • Dashboard visualization of monthly parking tickets in Philadelphia
    • Data from OpenDataPhilly

Goal: exploring and extracting insight from complex datasets

Parking Jawn

Class logistics

General info

  • 3-hour classes — mix of lecturing, interactive demos, and in-class lab time
  • short break about halfway through
  • My email: nhand@design.upenn.edu
    • Office Hours:
      • 2-hours during the week
      • Time is TBD
  • Teaching Assistant: Chloe Sheen
    • Email: csheen1@sas.upenn.edu
    • Office hours: TBD

Course GitHub

Piazza

Important: don't post lengthy screenshots of your code publicly — use private post and we will respond

The goals of this course

  • Provide students with the knowledge and tools to turn data into meaningful insights and stories
  • Focus on the modern data science tools within the Python ecosystem
  • The pipeline approach to data science:
    • gathering, storing, analyzing, and visualizing data to tell stories
  • Real-world applications of analysis techniques in the urban planning and public policy realm

What we'll cover

Module 1

Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python

Module 2

Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.

Module 3

Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.

Module 4

Geospatial Data Science in the Wild: Armed with the necessary data science tools, students will be introduced to a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.

Module 5

From Exploration to Storytelling: The final module will teach students to present their analysis results using web-based formats to transform their insights into interactive stories.

Assignments and grading

  • Grading:
    • 50% homework
    • 40% final project
    • 10% participation (based on both in-class participation and on Piazza)
  • Late policy: accepted up to 1 week late with a 10% deduction

Homeworks will be assigned (roughly) every two weeks. You must complete five of the seven homework assignments. Four of the assignments are required, and you are allowed to choose the last assignment to complete (out of the remaining three options).

Screen%20Shot%202019-08-28%20at%2010.12.35%20PM.png

Screen%20Shot%202019-08-28%20at%2010.13.47%20PM.png

Final Project

The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.

Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.

Okay, let's get started...

The Incredible Growth of Python

A StackOverflow analysis

The rise of the Jupyter notebook

The engine of collaborative data science

  • First started by a physics grad student around 2001
  • Known as the IPython notebook originally
  • Starting getting popular in ~2011
  • First funding received in 2015 $\rightarrow$ the Jupyter notebook was born

Google searches for Jupyter notebook


Key features

  • Aimed at "computational narratives" — telling stories with data
  • interactive, reproducible, shareable, user-friendly, visualization-focused

very versatile: good for both exploratory data analysis and polished finished products

Beyond the Jupyter notebook

Google's Colaboratory


  • A fancier notebook experience built on top of Jupyter notebook
  • Running in the cloud on Google's servers
  • An internal Google product that was recently released publicly
  • Very popular for Python-based machine learning
  • Won't need to use much in this course

Allows you to launch a repository of Jupyter notebooks on GitHub in the cloud

Note: as a free service, it can be a bit slow sometimes

Weekly lectures are available on Binder


Weekly Workflow

  1. Setup local Python environment as part of first homework assignment (week 1)
  2. Follow along on lectures using Binder in the cloud
  3. Install new Python dependencies each week into your local environment so you can work on homeworks locally

Now to the fun stuff...

These slides are a Jupyter notebook.

A mix of code cells and text cells in Markdown. Change the type of cell in the top menu bar.

In [21]:
# A simple code cell
# SHIFT-ENTER to execute

x = 10
print(x)
10

Python data types

In [22]:
# integer
a = 10

# float
b = 10.5

# string
c = "this is a test string"

# lists
d = list(range(10))

# booleans 
e = True

# dictionaries
f = {'key1': 1, "key2": 2}
In [23]:
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
10
10.5
this is a test string
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
True
{'key1': 1, 'key2': 2}

Alternative method for creating a dictionary

In [24]:
f = dict(key1=1, key2=2, key3=3)

Accessing dictionary values

In [25]:
# access the value with key 'key1'
f['key1']
Out[25]:
1

Accessing list values

In [26]:
# access the second list entry (0 is the first index)
d[1]
Out[26]:
1

Accessing characters of a string

In [27]:
# the first character
c[0]
Out[27]:
't'

Iterators and for loops

In [28]:
# Python code
result = 0
for i in range(100):
    result = result + i
In [29]:
print(result)
4950

Python's inline syntax

In [30]:
a = range(10) # this is an iterator
In [31]:
print(a)
range(0, 10)
In [32]:
# convert it to a list explicitly
a = list(range(10))
print(a)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [33]:
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]
print(a)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Python functions

def function_name(arg1, arg2, arg3):

    .
    .
    .
    code lines (indented)
    .
    .
    .

    return result
In [34]:
def compute_square(x):
    
    return x * x
In [35]:
sq = compute_square(5)
print(sq)
25

Keywords: arguments with a default!

In [36]:
def compute_product(x, y=5):
    
    return x * y
In [37]:
# use the default value for y
print(compute_product(5))
25
In [38]:
# specify a y value other than the default
print(compute_product(5, 10))
50
In [39]:
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(x=5, y=2))
10
10
In [40]:
# argument names must match the function signature though!
print(compute_product(5, z=5))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-5db83a932315> in <module>
      1 # argument names must match the function signature though!
----> 2 print(compute_product(5, z=5))

TypeError: compute_product() got an unexpected keyword argument 'z'

Getting help in the notebook

Use tab auto-completion and the ? and ?? operators

In [41]:
this_variable_has_a_long_name = 5
In [47]:
# try hitting tab after typing this_ 
this_variable_has_a_long_name
Out[47]:
5
In [48]:
# try typing "r" and then tab
In [49]:
# Forget how to create a range? --> use the help message
range?

Peeking at the source code for a function

Use the ?? operator

In [50]:
compute_product??

The question mark operator gives you access to the help message for any variable or function. Very useful

Getting more Python help

This was a very brief introduction. Additional Python tutorials listed on this week's README and on the course materials page.

The Data Science Handbook

The The Python Data Science Handbook is a free, online textbook covering the Python basics needed in this course. In particular, the first four chapters are excellent:

Note that you can click on the "Open in Colab" button for each chapter and run the examples interactively using Google Colab.

Python Data Analysis: the pandas package

Documentation is available at https://pandas.pydata.org

The following line imports the pandas package:

In [51]:
import pandas as pd

Basic pandas concepts

The primary objects in pandas are the:

  • DataFrame, which is very similar to an Excel spreadsheet, and has rows and named columns
  • Series, which represents a single column of data. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in R.

You can think Series objects as fancier versions of Python's built-in list data type

To create a Series object:

In [52]:
# city names and population
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199]) 
In [53]:
city_names
Out[53]:
0    San Francisco
1         San Jose
2       Sacramento
dtype: object

DataFrame objects can be created by passing a dict mapping string column names to their respective Series.

In [54]:
df = pd.DataFrame({ 'City Name': city_names, 'Population': population })
df
Out[54]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199

Accessing data

In [55]:
# access columns with a dict-like syntax
df['Population']
Out[55]:
0     852469
1    1015785
2     485199
Name: Population, dtype: int64

Rows can be accessed using Python's syntax for slicing

In [56]:
# list slicing: get the elements with indices 1 and 2 (but NOT 3)
cities = ['San Francisco', 'San Jose', 'Sacramento']
cities[1:3]
Out[56]:
['San Jose', 'Sacramento']
In [57]:
# slicing data frame rows is very similar!
df[1:3]
Out[57]:
City Name Population
1 San Jose 1015785
2 Sacramento 485199

pandas includes functionality for many different ways of selecting data. See the documentation for many more examples.

Manipulating data

In [58]:
df['Population'].median()  
Out[58]:
852469.0

NumPy is a popular toolkit for scientific computing.

pandas Series can be used as arguments to most NumPy functions:

In [59]:
import numpy as np

# calculate the median population value
np.median(df['Population'])
Out[59]:
852469.0

For more complex single-column transformations, you can use Series.apply. It accepts a function that is applied to each value in the Series.

For example, we can find which cities have a population greater than a million:

In [61]:
# define our function
def get_large_cities(population):
    return population > 1e6

large_cities = df['Population'].apply(get_large_cities)
In [62]:
large_cities
Out[62]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [64]:
# add the new computed column to our original data frame
df['Large Cities'] = large_cities

# we can also use lambda (unnamed, inline) functions
df['Small Cities'] = df['Population'].apply(lambda population: population < 1e6)

# print out 
df
Out[64]:
City Name Population Large Cities Small Cities
0 San Francisco 852469 False True
1 San Jose 1015785 True False
2 Sacramento 485199 False True

Data selection

We can select the "large" cities by passing the boolean values to the .loc() function of a DataFrame:

In [65]:
df['Large Cities']
Out[65]:
0    False
1     True
2    False
Name: Large Cities, dtype: bool
In [66]:
df.loc[df['Large Cities']]
Out[66]:
City Name Population Large Cities Small Cities
1 San Jose 1015785 True False

There is a faster, more concise, to achieve this!

In [67]:
df['Population'] > 1e6
Out[67]:
0    False
1     True
2    False
Name: Population, dtype: bool
In [68]:
# this is equivalent to doing 
df.loc[df['Population'] > 1e6]
Out[68]:
City Name Population Large Cities Small Cities
1 San Jose 1015785 True False

The documentation is your best friend! How to get help?

In [69]:
# use the question mark
pd.DataFrame.loc?

Now onto a more interesting example...

Let's load census data on rentals rates from the 2017 Amercian Community Survey.

We've stored the data locally in the comma-separated value (CSV) format. Later in the course, you'll learn how to download them directly using Python.

Use the read_csv() function from pandas. The first argument to read_csv() is the file path to load.

In [70]:
df = pd.read_csv("./data/census/ACS_17_1YR_B25003.csv")
In [71]:
df.head()
Out[71]:
GEO.id GEO.id2 GEO.display-label HD01_VD01 HD02_VD01 HD01_VD02 HD02_VD02 HD01_VD03 HD02_VD03
0 1600000US0107000 107000 Birmingham city, Alabama 90186 3374 40295 2507 49891 3372
1 1600000US0121184 121184 Dothan city, Alabama 25192 875 14935 763 10257 722
2 1600000US0135896 135896 Hoover city, Alabama 34058 1330 22913 1724 11145 1482
3 1600000US0137000 137000 Huntsville city, Alabama 84750 2687 46677 2948 38073 2680
4 1600000US0150000 150000 Mobile city, Alabama 78145 2673 41297 2323 36848 2325

Select certain columns and rename them:

In [72]:
# columns we want
columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
df2 = df[columns].copy()

# rename columns
df2.columns = ['City', 'Total', 'Rentals']

df2.head()
Out[72]:
City Total Rentals
0 Birmingham city, Alabama 90186 49891
1 Dothan city, Alabama 25192 10257
2 Hoover city, Alabama 34058 11145
3 Huntsville city, Alabama 84750 38073
4 Mobile city, Alabama 78145 36848
In [73]:
df2['Rental Rate'] = df2['Rentals'] / df2['Total'] * 100
In [74]:
df2.head()
Out[74]:
City Total Rentals Rental Rate
0 Birmingham city, Alabama 90186 49891 55.320116
1 Dothan city, Alabama 25192 10257 40.715306
2 Hoover city, Alabama 34058 11145 32.723589
3 Huntsville city, Alabama 84750 38073 44.923894
4 Mobile city, Alabama 78145 36848 47.153369
In [75]:
len(df2)
Out[75]:
608

We can select the largest cities by population using the Series.isin() function, which checks if each value in the Series is in the specified list.

In [76]:
top_cities = ['Philadelphia city, Pennsylvania', 
              'New York city, New York', 
              'Los Angeles city, California',
              'Houston city, Texas', 
              'Chicago city, Illinois', 
              'Phoenix city, Arizona']

#selection =  # which rows are valid
df2.loc[df2['City'].isin(top_cities)] # select the valid rows
Out[76]:
City Total Rentals Rental Rate
18 Phoenix city, Arizona 559155 245335 43.876027
94 Los Angeles city, California 1384851 878061 63.404727
275 Chicago city, Illinois 1047695 577430 55.114322
426 New York city, New York 3159674 2127881 67.344954
474 Philadelphia city, Pennsylvania 606142 309279 51.024182
525 Houston city, Texas 837686 479409 57.230155

An aside about string formatting

Python uses the "%" operator to insert variable values into strings. For example, the file name of the data we want to load for 2017 is:

In [77]:
year_str = "17"
path = "./data/census/ACS_%s_1YR_B25003.csv" % year_str
print(path)
./data/census/ACS_17_1YR_B25003.csv

See this guide on this type of string formatting for more info.

Python f-strings

Python version 3.6 introduced a new and improved string formatting syntax, where the variable that is being inserted can be referenced directly using the variable name.

The variable name should be enclosed in curly braces inside the string:

In [78]:
year_str = "17"
path = f"./data/census/ACS_{year_str}_1YR_B25003.csv"
print(path)
./data/census/ACS_17_1YR_B25003.csv

For more info: see this guide

Now let's load data from 2005 to 2017:

In [79]:
data = []
for year in range(2005, 2018):
    
    # read data for this year
    year_str = str(year) # convert integer value of "year" to a string
    year_str = year_str[2:] # extract the last two digits of the year string
    df = pd.read_csv(f"./data/census/ACS_{year_str}_1YR_B25003.csv")
     
    # columns we want
    columns = ['GEO.display-label', 'HD01_VD01', 'HD01_VD03']
    df = df[columns]

    # rename columns
    df.columns = ['City', 'Total', 'Rentals']
    
    # calculate the rental rate
    df['Rental Rate'] = df['Rentals'] / df['Total'] * 100
    
    # select the cities we want
    selection = df['City'].isin(top_cities)
    df = df.loc[selection]
    
    # add a column for the year
    df['Year'] = year
    
    # save 
    data.append(df)

# combine all of the data frames into one along the row axis
data = pd.concat(data, axis=0)
In [80]:
data.head()
Out[80]:
City Total Rentals Rental Rate Year
13 Phoenix city, Arizona 503753 204914 40.677475 2005
75 Los Angeles city, California 1284124 771325 60.066240 2005
218 Chicago city, Illinois 1020605 525620 51.500825 2005
349 New York city, New York 3026196 2023924 66.880136 2005
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005

Basic plotting: matplotlib

Much more to come next week. For now, we'll use some very simple plotting commands with Python's main plotting libary matplotlib.

Import the library we'll use from matplotlib.

In [81]:
from matplotlib import pyplot as plt

We'll use the plt.plot() function which plots a simple x vs y line.

In [82]:
with plt.style.context('fivethirtyeight'):
    for city in data['City'].unique():
        
        print(city)
        
        this_city = data['City'] == city
        df = data.loc[this_city]
        plt.plot(df['Year'], df['Rental Rate'], label=city)

# format
plt.legend(loc=0, ncol=3)
ax = plt.gca()
ax.set_ylim(30, 72)
ax.set_ylabel("Rental Rate in Percent")
ax.figure.set_size_inches((10, 6))
Phoenix city, Arizona
Los Angeles city, California
Chicago city, Illinois
New York city, New York
Philadelphia city, Pennsylvania
Houston city, Texas

From 2005 to 2017, Philadelphia had the largest percent increase

Exercise: can you calculate what the increase was for Philadelphia?

Exercise solution

In [83]:
# trim to just Philadelphia first, using this boolean selection index
data['City']=='Philadelphia city, Pennsylvania'
Out[83]:
13     False
75     False
218    False
349    False
393     True
       ...  
94     False
275    False
426    False
474     True
525    False
Name: City, Length: 78, dtype: bool
In [84]:
# select Philadelphia only
philly = data.loc[data['City']=='Philadelphia city, Pennsylvania']
philly
Out[84]:
City Total Rentals Rental Rate Year
393 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
400 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
408 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
415 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
419 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
427 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
434 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
437 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
449 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
454 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
458 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
465 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
474 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017

An aside about labels and the index in Panda

Note how the first row of the above dataframe has an index label of 393. This means we can access it using the .loc[] function like:

In [85]:
philly.loc[393]
Out[85]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                            43.2928
Year                                      2005
Name: 393, dtype: object

This is different than the .iloc[] function, which also indexes based on the integer value of the row, regardless of what the labels in the index are. For example, the first row of a dataframe can always be accessed using:

In [86]:
philly.iloc[0]
Out[86]:
City           Philadelphia city, Pennsylvania
Total                                   565433
Rentals                                 244792
Rental Rate                            43.2928
Year                                      2005
Name: 393, dtype: object

We can reset the index labels so they range from 0 to the length of the dataframe, using the reset_index() function. For example

In [87]:
philly.reset_index(drop=True)
Out[87]:
City Total Rentals Rental Rate Year
0 Philadelphia city, Pennsylvania 565433 244792 43.292839 2005
1 Philadelphia city, Pennsylvania 554048 231323 41.751437 2006
2 Philadelphia city, Pennsylvania 562384 239363 42.562199 2007
3 Philadelphia city, Pennsylvania 578263 253084 43.766245 2008
4 Philadelphia city, Pennsylvania 569835 258459 45.356814 2009
5 Philadelphia city, Pennsylvania 575413 264129 45.902508 2010
6 Philadelphia city, Pennsylvania 576429 265126 45.994563 2011
7 Philadelphia city, Pennsylvania 579874 277323 47.824700 2012
8 Philadelphia city, Pennsylvania 582528 285430 48.998503 2013
9 Philadelphia city, Pennsylvania 577862 277301 47.987409 2014
10 Philadelphia city, Pennsylvania 581604 275397 47.351291 2015
11 Philadelphia city, Pennsylvania 580205 277707 47.863600 2016
12 Philadelphia city, Pennsylvania 606142 309279 51.024182 2017

Next, to calculate our percent difference we need to get the values for Philadelphia in 2005 and 2017:

In [88]:
# select 2005
x = philly.loc[philly['Year']==2005]
value_2005 = x['Rental Rate'].iloc[0]
print("2005 value = ", value_2005)
2005 value =  43.292839293072745
In [89]:
x['Rental Rate']
Out[89]:
393    43.292839
Name: Rental Rate, dtype: float64
In [90]:
# select 2017
y = philly.loc[philly['Year']==2017]
value_2017 = y['Rental Rate'].iloc[0]
print("2017 value = ", value_2017)
2017 value =  51.02418245229665
In [91]:
# calculate percent change
percent_change = (value_2017 - value_2005)/ value_2005 * 100
In [92]:
print("change in rental rate in Philadelphia from 2005 to 2017 = ", percent_change)
change in rental rate in Philadelphia from 2005 to 2017 =  17.85824927509661

For more details on the iloc() function, see the documentation on indexing by position.

Now let's compare Zillow rental and home value data

Load citywide Zillow Rent Index (ZRI) and Zillow Home Value Index (ZHVI) data.

Files were downloaded earlier from https://www.zillow.com/research/data/

In [93]:
home_values = pd.read_csv("./data/zillow/Metro_Zhvi_AllHomes.csv", engine='python')
rent_values = pd.read_csv('./data/zillow/Metro_Zri_SingleFamilyResidenceRental.csv', engine='python')

Peek at the first few rows of the ZRI data:

In [94]:
rent_values.head()
Out[94]:
RegionID RegionName SizeRank 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 ... 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11
0 102001 United States 0 1250.0 1248.0 1245.0 1240.0 1236.0 1234.0 1236.0 ... 1411 1412 1414 1416 1416 1415 1415 1414 1417 1426
1 394913 New York, NY 1 NaN NaN NaN NaN NaN NaN NaN ... 2422 2417 2417 2418 2417 2416 2415 2415 2425 2438
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 2283.0 2272.0 2263.0 2259.0 2258.0 2263.0 2265.0 ... 2856 2866 2877 2886 2890 2892 2892 2893 2897 2911
3 394463 Chicago, IL 3 1581.0 1569.0 1560.0 1551.0 1541.0 1534.0 1530.0 ... 1698 1696 1696 1696 1695 1694 1694 1693 1697 1713
4 394514 Dallas-Fort Worth, TX 4 1297.0 1297.0 1297.0 1298.0 1298.0 1300.0 1304.0 ... 1624 1622 1622 1621 1619 1618 1618 1616 1620 1630

5 rows × 100 columns

And do the same for the ZHVI data:

In [95]:
home_values.head()
Out[95]:
RegionID RegionName SizeRank 1996-04 1996-05 1996-06 1996-07 1996-08 1996-09 1996-10 ... 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11
0 102001 United States 0 100600.0 100600.0 100600.0 100700.0 100800.0 101000.0 101200.0 ... 211500 213200 214400 215600 216800 217900 219100 220200 221600 222800
1 394913 New York, NY 1 165000.0 164800.0 164600.0 164300.0 164100.0 163900.0 163800.0 ... 420800 423400 425300 426600 427600 428600 429500 431100 433500 435700
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 170600.0 170400.0 170100.0 169800.0 169500.0 169300.0 169200.0 ... 635900 639000 640600 641900 643200 644700 646400 647500 648500 650400
3 394463 Chicago, IL 3 138300.0 138900.0 139300.0 139400.0 139300.0 138800.0 138000.0 ... 218000 219300 219900 220500 220900 221100 221500 222100 222800 223600
4 394514 Dallas-Fort Worth, TX 4 115300.0 115400.0 115500.0 115700.0 115900.0 116100.0 116400.0 ... 222600 225500 227000 228400 230200 231700 232600 233800 235900 237800

5 rows × 275 columns

Selecting the cities we want

In [96]:
valid_cities = ['New York, NY', 'Chicago, IL', 
                'Los Angeles-Long Beach-Anaheim, CA', 
                'Philadelphia, PA', 'Houston, TX', 'Phoenix, AZ']
In [97]:
selection = home_values['RegionName'].isin(valid_cities)
home_values_trimmed = home_values.loc[selection]
In [98]:
selection = rent_values['RegionName'].isin(valid_cities)
rent_values_trimmed = rent_values.loc[selection]
In [99]:
rent_values_trimmed
Out[99]:
RegionID RegionName SizeRank 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 ... 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11
1 394913 New York, NY 1 NaN NaN NaN NaN NaN NaN NaN ... 2422 2417 2417 2418 2417 2416 2415 2415 2425 2438
2 753899 Los Angeles-Long Beach-Anaheim, CA 2 2283.0 2272.0 2263.0 2259.0 2258.0 2263.0 2265.0 ... 2856 2866 2877 2886 2890 2892 2892 2893 2897 2911
3 394463 Chicago, IL 3 1581.0 1569.0 1560.0 1551.0 1541.0 1534.0 1530.0 ... 1698 1696 1696 1696 1695 1694 1694 1693 1697 1713
5 394974 Philadelphia, PA 5 1557.0 1567.0 1572.0 1576.0 1574.0 1574.0 1573.0 ... 1690 1689 1689 1688 1686 1685 1684 1683 1686 1696
6 394692 Houston, TX 6 1370.0 1366.0 1363.0 1361.0 1362.0 1362.0 1363.0 ... 1582 1583 1585 1587 1587 1587 1587 1587 1588 1591
14 394976 Phoenix, AZ 14 1207.0 1209.0 1204.0 1196.0 1188.0 1184.0 1184.0 ... 1394 1397 1401 1403 1402 1399 1397 1396 1402 1423

6 rows × 100 columns

Removing unwanted columns

Unwanted columns can be dropped from the data frame using the drop() function.

In [100]:
unwanted = ['SizeRank', 'RegionID']
home_values_final = home_values_trimmed.drop(unwanted, axis=1)
In [101]:
unwanted = ['SizeRank', 'RegionID']
rent_values_final = rent_values_trimmed.drop(unwanted, axis=1)
In [102]:
rent_values_final
Out[102]:
RegionName 2010-11 2010-12 2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 ... 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11
1 New York, NY NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2422 2417 2417 2418 2417 2416 2415 2415 2425 2438
2 Los Angeles-Long Beach-Anaheim, CA 2283.0 2272.0 2263.0 2259.0 2258.0 2263.0 2265.0 2268.0 2270.0 ... 2856 2866 2877 2886 2890 2892 2892 2893 2897 2911
3 Chicago, IL 1581.0 1569.0 1560.0 1551.0 1541.0 1534.0 1530.0 1526.0 1526.0 ... 1698 1696 1696 1696 1695 1694 1694 1693 1697 1713
5 Philadelphia, PA 1557.0 1567.0 1572.0 1576.0 1574.0 1574.0 1573.0 1575.0 1577.0 ... 1690 1689 1689 1688 1686 1685 1684 1683 1686 1696
6 Houston, TX 1370.0 1366.0 1363.0 1361.0 1362.0 1362.0 1363.0 1362.0 1364.0 ... 1582 1583 1585 1587 1587 1587 1587 1587 1588 1591
14 Phoenix, AZ 1207.0 1209.0 1204.0 1196.0 1188.0 1184.0 1184.0 1187.0 1190.0 ... 1394 1397 1401 1403 1402 1399 1397 1396 1402 1423

6 rows × 98 columns

Wide vs long format for datasets

Currently, our data is in wide format $\rightarrow$ each observation has its own column. This usually results in many columns but few rows.

In [103]:
home_values_final
Out[103]:
RegionName 1996-04 1996-05 1996-06 1996-07 1996-08 1996-09 1996-10 1996-11 1996-12 ... 2018-02 2018-03 2018-04 2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11
1 New York, NY 165000.0 164800.0 164600.0 164300.0 164100.0 163900.0 163800.0 163800.0 164000.0 ... 420800 423400 425300 426600 427600 428600 429500 431100 433500 435700
2 Los Angeles-Long Beach-Anaheim, CA 170600.0 170400.0 170100.0 169800.0 169500.0 169300.0 169200.0 169200.0 169400.0 ... 635900 639000 640600 641900 643200 644700 646400 647500 648500 650400
3 Chicago, IL 138300.0 138900.0 139300.0 139400.0 139300.0 138800.0 138000.0 137000.0 136100.0 ... 218000 219300 219900 220500 220900 221100 221500 222100 222800 223600
5 Philadelphia, PA 109700.0 109700.0 109600.0 109500.0 109500.0 109400.0 109400.0 109400.0 109500.0 ... 225500 226100 226400 226800 227400 228100 228700 229400 230300 231100
6 Houston, TX NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 194300 195800 196800 197900 199000 199700 200400 201200 202200 203200
14 Phoenix, AZ 104800.0 105400.0 106100.0 106700.0 107300.0 107900.0 108500.0 109000.0 109600.0 ... 248700 250300 251600 252900 254500 256100 257300 258600 260500 262000

6 rows × 273 columns

Usually it's better to have data in tidy (also known as long) format.

Tidy datasets are arranged such that each variable is a column and each observation is a row.

In our case, we want to have a column called ZRI and one called ZHVI and a row for each month that the indices were measured.

pandas provides the melt() function for converting from wide formats to tidy formats.

In [104]:
pd.melt?

Now, let's melt our datasets:

In [105]:
ZHVI = pd.melt(home_values_final, id_vars=['RegionName'], 
                value_name='ZHVI', var_name='Date')
ZRI = pd.melt(rent_values_final, id_vars=['RegionName'], 
              value_name='ZRI', var_name='Date')

and take a look:

In [106]:
ZRI.head()
Out[106]:
RegionName Date ZRI
0 New York, NY 2010-11 NaN
1 Los Angeles-Long Beach-Anaheim, CA 2010-11 2283.0
2 Chicago, IL 2010-11 1581.0
3 Philadelphia, PA 2010-11 1557.0
4 Houston, TX 2010-11 1370.0
In [107]:
ZHVI.head()
Out[107]:
RegionName Date ZHVI
0 New York, NY 1996-04 165000.0
1 Los Angeles-Long Beach-Anaheim, CA 1996-04 170600.0
2 Chicago, IL 1996-04 138300.0
3 Philadelphia, PA 1996-04 109700.0
4 Houston, TX 1996-04 NaN

Note that missing data values are represented as NaN ("not a number")

Merging data frames

Another common operation is merging, also known as joining, two datasets.

We can use the merge() function to merge observations that have the same Date and RegionName values.

In [108]:
data = pd.merge(ZRI, ZHVI, on=['Date', 'RegionName'], how='outer')
In [109]:
data
Out[109]:
RegionName Date ZRI ZHVI
0 New York, NY 2010-11 NaN 358800.0
1 Los Angeles-Long Beach-Anaheim, CA 2010-11 2283.0 410300.0
2 Chicago, IL 2010-11 1581.0 182800.0
3 Philadelphia, PA 2010-11 1557.0 204400.0
4 Houston, TX 2010-11 1370.0 137500.0
... ... ... ... ...
1627 Los Angeles-Long Beach-Anaheim, CA 2010-10 NaN 412900.0
1628 Chicago, IL 2010-10 NaN 184700.0
1629 Philadelphia, PA 2010-10 NaN 205700.0
1630 Houston, TX 2010-10 NaN 138000.0
1631 Phoenix, AZ 2010-10 NaN 140800.0

1632 rows × 4 columns

Merging is very powerful and the merge can be done in a number of ways. See the infographic on joining in this repository.

Datetime objects

Currently our Date column is stored as a string.

pandas includes additional functionality for dates, but first we must convert the strings using the to_datetime() function.

In [110]:
# convert to data
data['Date'] = pd.to_datetime(data['Date'])

Quick trick: Series that hold Datetime objects have a dt attribute that let's you grab parts of the date easily.

For example, we can easily add new columns for the month and year using:

In [111]:
# Note the the dtype is now datetime64[ns]
data['Date'].head()
Out[111]:
0   2010-11-01
1   2010-11-01
2   2010-11-01
3   2010-11-01
4   2010-11-01
Name: Date, dtype: datetime64[ns]
In [112]:
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year 
In [113]:
data.head()
Out[113]:
RegionName Date ZRI ZHVI Month Year
0 New York, NY 2010-11-01 NaN 358800.0 11 2010
1 Los Angeles-Long Beach-Anaheim, CA 2010-11-01 2283.0 410300.0 11 2010
2 Chicago, IL 2010-11-01 1581.0 182800.0 11 2010
3 Philadelphia, PA 2010-11-01 1557.0 204400.0 11 2010
4 Houston, TX 2010-11-01 1370.0 137500.0 11 2010

Annual trends: grouping by Year

pandas is especially useful for grouping and aggregating data via the groupby() function.

From the pandas documentation, groupby means:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.

The documentation is available here.

We can calculate annual averages for each year by grouping by the RegionName and Year columns and taking the mean of our desired column. For example:

In [114]:
# calculate mean values for each Year and City (RegionName)
annual_ZHVI = data.groupby(['RegionName', 'Year'])['ZHVI'].mean() 
annual_ZRI = data.groupby(['RegionName', 'Year'])['ZRI'].mean()
In [115]:
print(type(annual_ZHVI))
<class 'pandas.core.series.Series'>
In [116]:
annual_ZHVI.head()
Out[116]:
RegionName   Year
Chicago, IL  1996    138344.444444
             1997    130316.666667
             1998    137841.666667
             1999    147516.666667
             2000    159900.000000
Name: ZHVI, dtype: float64

Note that here that the result is indexed by the columns we grouped by (RegionName and Year).

We can reset the index so that the index values are listed as columns in the data frame again.

In [117]:
annual_ZHVI = annual_ZHVI.reset_index()
annual_ZRI = annual_ZRI.reset_index()
In [118]:
annual_ZHVI.head(n=50)
Out[118]:
RegionName Year ZHVI
0 Chicago, IL 1996 138344.444444
1 Chicago, IL 1997 130316.666667
2 Chicago, IL 1998 137841.666667
3 Chicago, IL 1999 147516.666667
4 Chicago, IL 2000 159900.000000
5 Chicago, IL 2001 172858.333333
6 Chicago, IL 2002 185350.000000
7 Chicago, IL 2003 197341.666667
8 Chicago, IL 2004 216033.333333
9 Chicago, IL 2005 237100.000000
10 Chicago, IL 2006 251775.000000
11 Chicago, IL 2007 251450.000000
12 Chicago, IL 2008 233575.000000
13 Chicago, IL 2009 206675.000000
14 Chicago, IL 2010 190841.666667
15 Chicago, IL 2011 171133.333333
16 Chicago, IL 2012 161150.000000
17 Chicago, IL 2013 168450.000000
18 Chicago, IL 2014 179666.666667
19 Chicago, IL 2015 188175.000000
20 Chicago, IL 2016 197633.333333
21 Chicago, IL 2017 208983.333333
22 Chicago, IL 2018 220536.363636
23 Houston, TX 1996 NaN
24 Houston, TX 1997 105920.000000
25 Houston, TX 1998 111800.000000
26 Houston, TX 1999 121250.000000
27 Houston, TX 2000 126775.000000
28 Houston, TX 2001 124141.666667
29 Houston, TX 2002 125650.000000
30 Houston, TX 2003 128583.333333
31 Houston, TX 2004 135741.666667
32 Houston, TX 2005 141791.666667
33 Houston, TX 2006 144033.333333
34 Houston, TX 2007 148366.666667
35 Houston, TX 2008 144950.000000
36 Houston, TX 2009 140841.666667
37 Houston, TX 2010 139341.666667
38 Houston, TX 2011 135175.000000
39 Houston, TX 2012 134516.666667
40 Houston, TX 2013 139483.333333
41 Houston, TX 2014 148350.000000
42 Houston, TX 2015 164275.000000
43 Houston, TX 2016 178841.666667
44 Houston, TX 2017 188108.333333
45 Houston, TX 2018 198454.545455
46 Los Angeles-Long Beach-Anaheim, CA 1996 169722.222222
47 Los Angeles-Long Beach-Anaheim, CA 1997 172133.333333
48 Los Angeles-Long Beach-Anaheim, CA 1998 188033.333333
49 Los Angeles-Long Beach-Anaheim, CA 1999 207566.666667

Plotting our results: ZHVI

In [119]:
with plt.style.context('fivethirtyeight'):
    for city in annual_ZHVI['RegionName'].unique():
        selection = annual_ZHVI['RegionName'] == city
        df = annual_ZHVI.loc[selection]
        
        plt.plot(df['Year'], df['ZHVI']/1e3, label=city)
    
    ax = plt.gca()
    ax.set_ylim(50, 800)
    ax.legend(loc=0, ncol=2)
    ax.figure.set_size_inches((10, 6))
    ax.set_ylabel('Zillow Home Value Index\n(in thousands of dollars)')

Home values in Philadelphia have recently recovered to pre-2008 levels

Plotting the results: Zillow Rent Index

In [120]:
with plt.style.context('fivethirtyeight'):
    for city in annual_ZRI['RegionName'].unique():
        selection = annual_ZRI['RegionName'] == city
        df = annual_ZRI.loc[selection]
        
        plt.plot(df['Year'], df['ZRI'], label=city)
    
    ax = plt.gca()
    ax.set_ylim(1000, 3300)
    ax.legend(loc=0, ncol=2)
    ax.set_ylabel('Zillow Rent Index (in dollars)')
    ax.figure.set_size_inches((10, 6))

Rent prices in Philadelphia have remained relatively flat, relative to other large cities.

First Homework Assignment

Two parts:

  1. Download and install Anaconda locally
  2. Analyze Zillow ZHVI data by ZIP code in Philadelphia, submitting your Jupyter notebook

Due date: by the start of next class

That's it!

  • Stay tuned for announcement of office hours in the coming days
  • Post questions on Piazza
  • Email questions/concerns to nhand@design.upenn.edu